Communication event

ABSTRACT

In a communication event between a first user and one or more second users via a communication network, a plurality of video streams is received via the network. Each of the streams carries a moving image of at least one respective user. The moving image of a first of the video streams is displayed at a user device of the first user for a first time interval. In the moving image of a second of the video streams that is not displayed at the user device in the first time interval, a human feature of the respective user is identified. A movement of the identified human feature during the first time interval that matches one of a plurality of expected movements is detected. In response to the detected movement, at least the moving image of the second video stream is displayed at the user device for a second time interval.

BACKGROUND

Voice over internet protocol (“VoIP”) communication systems allow theuser of a device to make calls across a communication network. To useVoIP, the user must install and execute client software on their device.The client software provides the VoIP connections as well as otherfunctions such as registration and authentication. Advantageously, inaddition to voice and video communication, the client may also providevideo calling and instant messaging (“IM”). With video calling, thecallers are able to view video images (i.e. moving images) of the otherparty in addition to voice information. This enables a much more naturalcommunication between the parties, as facial expressions are alsocommunicated, thereby making video calls more comparable to aface-to-face conversation.

A video call comprising multiple users may be referred to as a “videoconference”. In a conventional video conference, each participant (i.e.user) is able to view the video images of one or more of the otherparticipants (users) in the video conference. For example, as a defaultsetting, each user may be presented with the video images of all of theother users in the video conference. These may displayed, for example,using a grid, with each video image occupying a different location onthe grid. Alternatively, each user may be presented with one or morevideo images corresponding to users that have been detected as speakingusers. That is, the detection of audio from a speaker may determinewhich of the video images of the other users are selected for display ata particular user's user terminal. Typically, in a video conference, oneuser speaks at a time, and so this may result in a single video image ofthat user being displayed to each of the non-speaking users.

SUMMARY

This Summary is provided to introduce a selection of concepts in asimplified form that are further described below in the DetailedDescription. This Summary is not intended to identify key features oressential features of the claimed subject matter, nor is it intended tobe used to limit the scope of the claimed subject matter.

Various aspects of the present subject matter relate to a communicationevent between a first user and one or more second users via acommunication network. A plurality of video streams is received via thenetwork at a computer connected to the network. Each of the streamscarries a respective moving image of at least one respective user. Thecomputer causes the respective moving image of a first of the videostreams to be displayed at a user device of the first user for a firsttime interval. The computer identifies, in the respective moving imageof a second of the video streams that is not displayed at the userdevice in the first time interval, a human feature of the respectiveuser. The computer detects a movement of the identified human featureduring the first time interval that matches one of a plurality ofexpected movements. In response to the detected movement, the computercauses at least the respective moving image of the second video streamto be displayed at the user device for a second time interval.

DESCRIPTION OF FIGURES

For a better understanding of the present subject matter, and to showhow the same may be carried into effect, reference is made to thefollowing figures in which:

FIG. 1 shows a schematic block diagram of a communication system;

FIG. 2 shows a schematic block diagram of a user device;

FIG. 3 shows a functional block diagram of a server;

FIG. 4 shows a schematic illustration of a computer implementeddatabase;

FIG. 5 shows a flow chart for a method of selecting video streams fordisplaying at a user device during a video call, based on the detectionof an expected movement;

FIGS. 6A, 6B, 6C, 7A, 7B, 8A, and 8B show various illustrations of agraphical user interface of a client, at different stages during a videocall between a group of three or more users;

FIG. 9 schematically illustrates selectable predetermined layouts.

DESCRIPTION OF EMBODIMENTS

In a video conference conducted via a communication network, it may notalways be desirable to display all of the video images of the otherusers in the videoconference, to a particular user. This may be thecase, for example, where only a few of the users are active—i.e. doingsomething that may be of interest to one or more of the other users inthe videoconference and/or if there is a very large number of users onthe call. For example, only one or two of the users may be speakingusers. It may be desirable to prevent the video images associated withthe remaining users, i.e. the inactive users, from being displayed at auser's terminal. This ensures that the user terminal does not allocatedisplay resources to video data that does not add to the user'sexperience of the video conference. This is particularly, though notexclusively applicable, to mobile, tablet or certain laptop deviceswithin limited available display areas. It may also in some cases ensurethat network-bandwidth is not allocated to transmitting the videostreams associated with inactive users to other user terminalsunnecessarily, as discussed in further detail below.

This is referred to herein as “follow the action storytelling”, andguides the consuming participants with the group activity and groupresponse as the communication event proceeds. Currently, with activespeaking video conferences, the consuming participants may need tomonitor multiple video feeds at once to determine where non-verbalactivity is occurring. Alternatively, if they are only viewing an activespeaker they may not be aware of non-verbal changes in the groupactivity due to the absence of any suitable visual representation beingpresented to them. In storied experiences video may be displayed orsequenced due to the current action at hand. For “follow the action”storytelling, a combination of multi video grids and single video gridscan be used, depending on the size of the group activity and the numberof sensors capturing the event.

It may not always be desirable to only display video images of usersthat are identified as speaking (i.e displaying video images basedexclusively on the detection of verbal events). For example, a user maybe interested in a non-verbal event associated with one or more of theother users. This may include, for example, a change in the facialexpression of one or more of the other users, which may have occurred asa result of one of the users reacting to a speaking user's speech. Itmay be desirable to display one or more of these reactions, as and whenthey occur, so as to enable a user to view the activity of the otherusers in the video conference, in a story-like manner. These reactionsmay be displayed for a limited time interval; for example, to ensurethat a user's focus is not taken away from a speaking user for too long.

Furthermore, given that a non-verbal event is not associated with audiodata (i.e. there is no speech to detect), it may not be desirable toreplace both the audio and video data associated with a speaking user,with that of a user associated with a non-verbal event. For example, itmay be desirable to ensure that a speaking user's speech is still playedout at a user's user terminal, even if the video image that is beingdisplayed at that user's user terminal, does not correspond to the videoimage of the speaking user (e.g. if the video image corresponds to auser reacting to the speaking user's speech). Treating the audio andvideo data in this way ensures story continuity of the groupexperience—i.e. that a user's focus is brought to the relevant audio andvideo data, at the right time in the video conference.

The present disclosure addresses these issues by providing acommunication system that causes one or more video images of a videoconference to be displayed at a user terminal in a virtual “StoriedExperience View”. The virtual “Storied Experience View” harnesses thepower of video and storytelling to transform a meeting experience (i.e.video conference) beyond active speaking via a more engaging andlife-like meet up experience.

The Storied Experience View may comprise a single streaming video gridor a multi-streaming video grid where multiple videos and or audiochannels play at one time. By displaying video images in the StoriedExperience View, users are able to consume the most engaging and tellingstory of group activity, i.e. without having to monitor all of the videoimages of all of the other users in the video conference, in order todetermine where non-verbal user activity is occurring. In the StoriedExperience View, video may be displayed or sequenced due to the currentaction at hand using a combination of multiple video grids and singlevideo grids depending on the size of the group activity and the numberof sensors capturing the event.

In the present disclosure, one computer receives all of the videostreams from each of the respective users, via the network so that anintelligent decision about which to display can be made taking intoaccount all of their visual content. The computer has visibility of allof the candidate streams and is able to limit the number of these thatare selected for display at a particular user terminal, taking intoaccount non-verbal events i.e. changes in the visual content of themoving images carried by the streams. Because the computer receives allof the video streams via the network, it is best placed to makeintelligent decisions about which video streams to select. Limiting thenumber of video streams in this way is useful, where, for example, auser terminal has a limited display area. In such a case, it may not bemeaningful to display all of the video images of the other users in thevideo conference, at that user terminal (particularly if the videoconference has a large number of participants). The computer receivingthe streams is able to work within the confines of the limited displayarea whilst maximizing the information that is delivered to theconsuming user.

This is particularly, though not exclusively, the case where at leasttwo of the video streams are received from different clients running ondifferent client devices as each individual client is not necessarilyaware of the visual content of the other client's video stream(s).

In the described embodiments, the computer is embodied in a centralserver. This allows bandwidth to be saved, as the only the stream(s)selected for displaying to any given client need to be transmitted tothat client from the server. In this way, the server is able to usebandwidth efficiently, whilst maximising the amount of useful and/orengaging information that is conveyed to a consuming user. In thepresent disclosure, a duration timer is assigned to non-verbal singularevents. Upon detection of a non-verbal event the video is promoted andassigned a duration and priority in the active video stack of the livestory view sequence during a video call, providing activity awareness ofthe group non-verbal communication to the remote consuming attendeesvideo sequence for playback.

The present disclosure allows virtual attendees during live playback ofvideo based meet up experiences to track participant non-verbalcommunication activity and awareness during the story video viewexperience on the stage. A duration is assigned to the non-verbalcommunication priority item for story view experiences, resulting in aright place and right time for the activity to surface in the storyview. This increases participant engagement, activity and spatialawareness of the users.

In addition to live video in video calls, the present techniques mayalso be applied to recorded video of such calls at a later time.

Herein, references to users being currently “visible” in a moving image(or similar) carried by a video stream does not necessarily mean thatthe video image is currently being viewed. A user can be visible in amoving image that is not currently being displayed, in the sense thatthey are detectable in the visual content of that image based oncomputer-implemented image processing applied to the moving image, suchas facial or skeleton tracking applied to the image by a computer. Thevisual content of a moving image means information that is conveyed bypixel values of the moving image, and which would thus be conveyed to aviewer were that moving image to be displayed (i.e. played out) to him.Thus, in accordance with the present techniques, it is ultimatelychanges in those pixel values—and in particular a change in theinformation that is conveyed by the changing pixel values—that causescertain video images to be selected for displaying for appropriatelychosen intervals to convey the information change to one or more callparticipants. Each such change in the information conveyed by the visualcontent of a moving image is referred to individually herein as a“non-verbal singular event”, which includes for example changes in thenumber of users in the moving image and certain expected (i.e.recognizable) movements by a user in the moving image.

A moving image is also referred to herein as a “video image”, and meansa sequence of frames (i.e. static images) to be played out in quicksuccession to give the impression of movement. Unless otherwiseindicated, any references to “an image” below denote a moving image inthis sense, rather than a static image. References to “displaying avideo stream” mean displaying the moving image carried by that videostream.

FIG. 1 shows a communication system 100 comprising a first user 4 a(User A) who is associated with a first user terminal 6 a, a second user4 b (User B) who is associated with a second user terminal 6 b, a thirduser 4 c (User C) associated with a third user terminal 6 c and a fourthuser 4 d (User D) associated with a fourth user terminal 6 d. Whilstonly four users have been shown in FIG. 1, it will be appreciated thatthe communication system 100 may comprise any number of users andassociated user devices. It will also be appreciated that, whilst eachuser terminal 6 is shown with an associated camera device, 7, one ormore of the user terminals may be associated with one or more additionalcameras or sensors (e.g. microphone array, Kinect etc.), therebyallowing more streams of input from that location. For example, userterminal 6 b is shown to have an additional camera device 9. Theadditional camera device 9 may provide an alternative angle from whichto capture a video image of user 4 b (and/or user 4 e).

More generally, one or more peripheral devices, such as externalcameras, audio mics, motion sensors etc. may be connected to thenetwork. These can be checked in or added to a specific parent devicelocation via Bluetooth, WiFi, network login etc. These peripheraldevices may act as added sensors or user preference inputs. Sensorcoverage (i.e. the time at which particular sensors are activated) maybe constrained so as to cover a storied event at the right place andtime. For example, a standard type of stored experience may include“chapters” or “phases”; phases such as “start”, “story”, “end”, “manage”and “relive”. These chapters or phases may be used to manage thepriorities and coverage of behaviour so as to ensure that such behaviouris captured at the appropriate times.

The user terminals 6 a, 6 b, 6 c and 6 d can communicate over thenetwork 2 in the communication system 100, thereby allowing the users 4a, 4 b, 4 c and 4 d to communicate with each other over the network 2.The network 2 may be any suitable network that has the ability toprovide a communication channel between user terminals 6 a, 6 b, 6 c and6 d. For example, the network 2 may be the Internet or another type ofnetwork such as a High data rate mobile network, such as a 3rdgeneration (“3G”) mobile network.

The user terminals 6 a, 6 b, 6 c and 6 d can be any type of user devicesuch as, for example, a mobile phone, a personal digital assistant(“PDA”), a personal computer (“PC”) (including, for example, Windows™,Mac OS™ and Linux™ PCs), a gaming device (Xbox), a group room meetingdevice (e.g. Surface Hub) or other embedded device able to connect tothe network 106. Each user terminal is arranged to receive informationfrom and output information to one or more of the other user terminals.In one embodiment, each user terminal comprises a display such as ascreen and an input device such as a keypad, a touch-screen, cameradevice and/or a microphone.

User terminals 6 a, 6 b, 6 c and 6 d each execute a communication clientapplication provided by a software provider associated with thecommunication system. The communication client application is a softwareprogram executed on a local processor in the respective user terminal.The communication client application performs the processing required atthe respective user terminal in order for each user terminal to transmitand receive video data (carried in the form of video streams) over thenetwork 2. Each user terminal is connected to the network 2.

The communication client application is a videoconferencing applicationthat enables users 4 a, 4 b, 4 c and 4 d to participate in a videoconference. The communication client application provides a meansthrough which each user can share any video data captured at their userdevice (e.g. by an associated camera device, such as those shown at 7 a,7 b, 7 c and 7 d of FIG. 1) with one or more of the other users. Thecommunication client application also provides a means through whicheach user can receive, at their respective user terminal, the video datacaptured by the other participants of the video conference.

For example, a user, such as user A, may initiate the video conferenceby transmitting a request to one or more other users, such as users B, Cand D. Upon accepting the request from user A, users B, C and D may eachreceive video data from user A, and transmit their own video data toeach of the other users that have agreed to partake in the videoconference. For example, user B may receive the video data captured byone or more of users A, C and D.

Groups of people (i.e. users) may also be detected and identified at asingle location or via single or multiple devices. This is important forimproving group awareness and coverage from a single location into thevirtual storied experience. This also ensures that all of thedistributed people (users) and groups of people (users) are fullyengaged and aware of everyone's presence.

Connected to the network 2 is a control server 102 arranged to receivevideo streams from one or more user terminals (e.g. user terminals 6 a,6 b and 6 c) and to determine one or more other user terminals (e.g.user terminal 6 d) to transmit one or more of the received video streamsto. The control server 102 may be implemented on a single computingdevice. The control server 102 may also operate to support performanceof the relevant operations in a “cloud computing” environment whereby atleast some of the operations may be performed by a plurality ofcomputing devices.

User terminals 6 a, 6 b and 6 c may correspond to user terminal 6 d(which, in the following examples, is described as a “receivingterminal”). The user terminal 6 d executes, on a local processor, acommunication client which corresponds to the communication clientexecuted at the user terminals 6 a, 6 b and 6 c. The client at the userterminal 6 d performs the processing required to allow the user 4 d tocommunicate over the network 2 in the same way that the clients at userterminals 6 a, 6 b and 6 c perform the processing required to allow theusers 4 a, 4 b and 4 c to communicate over the network 2. The userterminals 6 a, 6 b, 6 c and 6 d are end points in the communicationsystem. FIG. 1 shows only four users (4 a, 4 b, 4 c and 4 d) and fouruser terminals (6 a, 6 b, 6 c and 6 d) for clarity, but many more usersand user devices may be included in the communication system 100, andmay communicate over the communication system 100 using respectivecommunication clients executed on the respective user devices, as isknown in the art.

FIG. 2 illustrates a detailed view of the user terminal 6 on which isexecuted a communication client for communicating over the communicationsystem 100. The user terminal 6 comprises a central processing unit(“CPU”) 202, to which is connected a display 204 such as a screen ortouch screen, input devices such as a keypad 206 and a camera 208. Anoutput audio device 210 (e.g. a speaker) and an input audio device 212(e.g. a microphone) are connected to the CPU 202. One or more additionalsensors (not shown) such as a “Kinect” device or Mixed Reality devicesuch as “Hololens” may also be connected to the CPU 202. The display204, keypad 206, camera 208, output audio device 210, input audio device212 and additional sensors may be integrated into the user terminal 6 asshown in FIG. 2. In alternative user terminals one or more of thedisplay 204, the keypad 206, the camera 208, the output audio device210, the input audio device 212 and the additional sensors may not beintegrated into the user terminal 6 and may be connected to the CPU 202via respective interfaces. One example of such an interface is a USBinterface. The CPU 202 is connected to a network interface 224 such as amodem for communication with the network 106. The network interface 224may be integrated into the user terminal 6 as shown in FIG. 2. Inalternative user terminals the network interface 224 is not integratedinto the user terminal 102. The user terminal 102 also comprises amemory 226 for storing data as is known in the art. The memory 226 maybe a permanent memory, such as ROM. The memory 226 may alternatively bea temporary memory, such as RAM.

FIG. 2 also illustrates an operating system (“OS”) 214 executed on theCPU 202. Running on top of the OS 214 is a software stack 216 for thecommunication client application referred to above. The software stackshows an I/O layer 218, a client engine layer 220 and a client userinterface layer (“UI”) 222. Each layer is responsible for specificfunctions. Because each layer usually communicates with two otherlayers, they are regarded as being arranged in a stack as shown in FIG.2. The operating system 214 manages the hardware resources of thecomputer and handles data being transmitted to and from the network 2via the network interface 224. The I/O layer 218 comprises audio and/orvideo codecs which receive incoming encoded streams and decodes them foroutput to speaker 210 and/or display 204 as appropriate, and whichreceive unencoded audio and/or video data from the microphone 212 and/orcamera 208 and encodes them for transmission as streams to otherend-user terminals of the communication system 100. The client enginelayer 220 handles the connection management functions of the VoIP systemas discussed above, such as establishing calls or other connections byserver-based or P2P address look-up and authentication. The clientengine may also be responsible for other secondary functions notdiscussed herein. The client engine layer 220 also communicates with theclient user interface layer 222. The client engine layer 220 may bearranged to control the client user interface layer 222 to presentinformation to the user of the user terminal 200 via the user interfaceof the client which is displayed on the display 204 and to receiveinformation from the user the user terminal 200 via the user interface.

A display module 228 of the UI layer 222 is shown. The display module228 determines the manner in which any video streams received over thenetwork (via the network interface) are displayed at the display of theuser terminal 6. For example, the display module may receive layoutparameters from the network interface, and use these to generate, orselect, a particular layout for displaying the one or more videostreams.

The display module may also receive data relating to the video streamsthemselves, such as, for example an associated priority value. Thedisplay module may use the priority value associated with a video streamto determine the duration for which that video stream shall be displayedat the user terminal 6 and/or where, within a predetermined layout, thevideo stream will be displayed.

FIG. 3 illustrates a more detailed view of the control server 102 shownin FIG. 1.

As can be seen in FIG. 3, the control server comprises a networkinterface 314 for receiving and transmitting video streams from and toother user terminals, over the communications network 2.

FIG. 3 corresponds to the control server of FIG. 1, where users A, B, Cand D are participants of a video conference.

For the sake of conciseness, the control server shown in FIG. 3 isdescribed from the perspective of determining which of the users, usersA, B and C, to display to a receiving user, User D. While stream s4 (thestream associated with user D) is not shown as an input to the controlserver, it will be appreciated that stream s4 may also be an input tothe control server, and the control server may determine for eachindividual user (i.e. users A, B, C and D), which of the other users(and their associated video streams) to display to that user.

In the example shown in FIG. 3, video streams s1, s2 and s3 are receivedat the network interface from user terminals 6 a, 6 b and 6 crespectively (each carrying a moving image of users A, B and Crespectively). As a result of the operations performed by selector 312(described later), streams s1 and s2 are selected and transmitted, viathe network interface, to User D's user terminal 6 d, herein referred tomore generally as the “receiving terminal”.

It should be noted that in alternative embodiments, two or more of thevideo streams may be received from a single camera device (i.e. there isnot necessarily a one-to-one mapping between camera devices and videostreams). In such a case, the video streams may be treated by theselector in the same way as if they had been received from separatedevices.

In the embodiment described in relation to FIG. 3, at least two of thevideo streams are received at the control server from differentinstances of the communication client application, running on differentuser devices. That is, at least two of the video streams are receivedfrom different network endpoints having different network addresses(e.g. different IP addresses, or at least different transportaddresses). For example, each of the video streams may be received froma different user terminal, where each of the different user terminalsexecute an instance of the communication client application (as is thecase with streams s1, s2 and s3 shown in FIG. 3). For example, eachinstance of the communication client application may be different in thesense that a different user has logged into the communication clientapplication to use it (e.g. using a unique username). In any case, theat least two of the video streams received at the control server arereceived from different instances of the communication clientapplication, and not, for example, from different but co-located cameradevices (i.e. all in a conference room), which may be connected to thenetwork via a single instance of the communication client application.

In other cases, some of the streams may be received at the server fromthe same client. That is, a client may transmit more than one stream tothe server allowing the server to select between different streams fromthe same client in the same manner.

For example, a single camera may stream multiple streams derived from alocally-captured “master” video image. For example, each stream maycarry video image corresponding to a respective part of the master image(e.g. of different regions, different cropping's etc.). As anotherexample, multiple camera feeds may be streamed via the network from onelocation to a shared virtual stage experience.

The network interface 314 is connected to a feature detection module308, which may for example comprise a skeletal tracking module 308and/or a facial detection module (not shown separately). The skeletaltracking module is configured to identify the skeletons of one or moreusers in one or more of the video streams received at the networkinterface. The skeletal tracking module may use the same process foridentifying skeletons as Microsoft's Kinect sensor. The facial detectionmodule is configured to detect the face(s) of any users in each videostream. In the example shown in FIG. 3, the feature detection module 308receives video streams s1, s2 and s3, and determines whether any users(or rather, skeletons) are present in the respective video streams.

Having identified that one of the video streams is carrying an image ofa user, the skeletal tracking module of the feature detection module 308may forward information about the detected user in the correspondingvideo stream to a feature tracking module 310. This information maycomprise an indication of where the “skeleton” of the user wasidentified within the moving image, for example corresponding topredetermined points on the user's body, e.g. corresponding to knownskeletal points. Either way, this allows the feature tracking module 310to identify particular human features within the moving image. Forexample, the identified “skeleton” of the user may provide a referencefrom which the feature tracking module can identify and track themovement of one or more human features. Alternatively or additionally,the facial detection module may provide information about the detectedface(s) to the feature tracking module 310, allowing the latter to trackthe corresponding facial movements.

Human features may include, for example, the arm, hands, and/or face ofa user. Human features may also include more specific human featuressuch as the eyes, mouth and nose of a user. By tracking the movement ofthese features over the time, the feature tracking module 310 is able todetect and distinguish between different types of reaction that anidentified user is having. For example, the feature tracking module 310may be able to identify user reactions such as: smiling, laughing,frowning, gasping, head nodding, head shaking, hand waving, handpointing, clapping, giving a thumbs up, raising or lowering their arms,celebrating with e.g. clenched fists etc.

The feature tracking module 310 may identify a user's reaction bycomparing the movement of one or more identified human features with theentries of a database 304 storing predetermined, i.e. expected movementsof the corresponding human features. The database of expected movements304 may be stored in memory 302 at the control server.

For example, each expected movement may be defined by a set ofparameters describing the movement of one or more human features. Thefeature tracking module 310 may determine the parameters describing themovement of one or more human features of an identified user and comparethese to the parameters describing known, i.e. expected movements, todetermine whether the user has performed an expected movement.

If the feature tracking module 310 determines that the identified user'smovement of one or more human features corresponds to one of theexpected movements in the database 304 the feature tracking module 310may provide an indication that the expected movement has been detected,to a selector 312.

Selector 312 is configured to receive each of the plurality of videostreams received at the network interface 314, and to determine which ofthese to cause to be displayed at one or more user terminals. In theexample of FIG. 3, selector 312 is configured to determine which of thevideo streams s1, s2 and s3 to cause to be displayed as User D's userterminal (i.e. the receiving terminal).

Selector 312 is also configured to receive an indication from featuretracking module 310 of any expected movements, i.e. reactions, that havebeen detected in any of the video streams received at the selector 312.This indication is herein referred to as the “reaction indicator”.

The reaction indicator may inform the selector 312 of any reactions”(i.e. expected movements) that were identified in one or more of thevideo streams received at the selector 312. This enables the selector312 to determine which of the plurality of received video streams toselect for display at a particular user's user terminal (in thisexample, user D's user terminal 6 d).

The reaction indicator also enables the selector 312 to determine a timeinterval for which the video stream associated with that reaction shouldbe displayed at a particular receiving terminal (again, in this example,user D's user terminal 6 d). For example, the selector 312 may use thereaction indicator to query the entries of a database storing a list ofpre-determined reactions and the time intervals for which thosereactions should be displayed at a receiving terminal. The entries ofsuch a database are shown in FIG. 4 (discussed later).

The selector 312 may for example, use the time interval associated withan identified reaction to determine the duration for which a selectedvideo stream should be transmitted to a particular receiving terminal(e.g. user D's user terminal 6 d).

The selector 312 may also use the reaction indicator to determine apriority associated with the identified reaction. For example, certainreactions may be deemed more worthy of display than others, and this maybe indicated in the associated priority value (i.e. the higher thepriority value, the more likely it is that the associated video streamis selected for display).

In a situation where reactions are detected in multiple video streams,but only a limited number of video streams can be displayed at aparticular receiving terminal, the selector 312 may use the priorityvalue associated with each of the detected reactions to determine whichof the associated video streams to select for display at the receivingterminal.

The priority value may also determine the manner in which a selectedvideo stream is displayed relative to any other video streams that arealso selected for display at the receiving terminal (i.e. relativeposition and size).

Having determined which of the plurality of video streams to display atthe receiving terminal (e.g. user D), the selector 312 may also select aparticular layout for displaying the one or more selected video streams(streams s1 and s2 in FIG. 3).

The selector 312 may have stored in memory, a selection of grid layouts,and the selector 312 may select a particular grid layout for displayingthe one or more selected video streams. The grid selected by theselector 312 may depend on the number of video streams that the selector312 has selected for display at a particular user terminal. The movingimages of the selected video streams may need to be cropped so as to bedisplayed at a particular location in the grid. For example, the one ormore moving images may be cropped so as to display the most importantinformation. The moving images may be cropped according to a tight,medium or wide view, depending on the detected expected movement and theselected grid layout.

The selector 312 may also use the priority associated with the reactionidentified in a selected video stream (based e.g. on whether a reactingor speaking user was detected) to determine where, within the selectedgrid layout, that video stream is to be displayed. Some examples ofpossible grid layouts are shown in FIG. 9. For example, grid layout 902may be used to display a single video stream, grid layout 904 may beused to display two video streams simultaneously, grid layout 906 may beused to display three video streams simultaneously, and grid layout 908to display four streams simultaneously. Whilst only four grid layoutsare shown in FIG. 9, it will be appreciated that a grid layout may beselected so as to display any number of video streams. For example, agrid layout comprising five or more units may be selected to displayfour selected video streams. Whilst the grid layouts shown in FIG. 9 areall shown with rectangular units, the units of each grid may be of anyshape and are not constrained so as to all be of the same shape.

For greater story continuity and fluid transitions between the differentvideo streams that are displayed at the receiving terminal, the selectormay be configured to ensure that there is a limited duration of time inwhich the units of the selected grid layout can be updated (i.e. a newvideo stream can be selected for display, at that unit of the grid).

For example, in one embodiment, the selector may ensure that only oneunit of the selected grid is changed at a time—i.e. no new video streamsare displayed at any of the other units of the grid, during the secondtime interval.

Alternatively, in a second embodiment, the selector may ensure thatthere is limited duration of time in which multiple units of theselected grid layout can be updated (i.e. to display the video streamsin which a change in the number of users was detected). For example,following e.g. the selection of a first video stream, the selector mayonly allow other units of the selected grid to be updated, if these canbe updated before the limited duration of time elapses.

These embodiments ensure that the least amount of video grid viewupdates occur within a designated duration of time, thereby making it aseasy as possible for users to follow user activity within the StoriedExperienced View. The selected grid and positioning of each of selectedvideo streams within the grid may be indicated to the receiving userterminal (e.g. terminal 6 d) in the form of layout parameters, as shownin FIG. 3. The receiving user terminal may interpret the layoutparameters so as to display each of the selected video streams at theirrespective positions in the selected grid.

For example, referring to FIG. 3, the selector 312 may receive anindication that reactions were detected in streams s1 and s2 and basedon this, select streams s1 and s2 for transmission to user 4 d's userterminal. The selector 312 may select, for example, grid layout 904,shown in FIG. 9, and forward the corresponding layout parameters to thereceiving terminal. In response to receiving the layout parameters, thereceiving terminal may then render the two video streams such that thefirst video stream, s1, is displayed at a first location of the grid(e.g. the left-hand unit of the grid), and the second video stream, s2,is displayed at a second location of the grid (e.g. the right-hand unitof the grid). In some embodiments, it may not be necessary to send ofall the layout parameters to the receiving terminal, if for example,there is no change in the number of video streams that are to bedisplayed at the receiving terminal (as described later in relation toFIGS. 7A and 7B).

Alternatively, the reaction indicator may indicate that a reaction wasdetected in stream s2 only. Based on this, the selector 312 maydetermine to increase the number of video streams displayed at user 4d's user terminal 6 d, by continuing to transmit stream s1 (which wasdisplayed at user 4 d's user terminal 6 d, prior to detecting a reactionfrom User 4B) and also transmitting stream s2 to User D. User D is thusbe able to view the reaction of User B, in addition to the video of UserA. In this particular example, user A may be for example, a speakinguser, while user B is a reacting user, reacting to user A's speech. Thecontrol server may transmit layout parameters for grid layout 904,instead of the layout parameters for 902, which were previously used todisplay User A's video stream (as described later in relation to FIGS.6A and 6C).

Continuity is important for the storied experience; if an event istagged as relating to a certain location it may replace that location'scurrent video stream location in the grid for the new duration timedevent (i.e. the second time interval), whereas a newly promoted eventmay occupy an added grid location or grid escalation.

Stylized grid, duration and location playback may have unique rules forunique circumstances. For example, an end of meeting “montage” coulddisplay a series of related and unrelated events next to each other inthe grid as a stylized reprise of the meeting event. For example, theduration timer for each event could be aligned or intentionally rhythmicto an audio track.

FIG. 4 shows a high-level representation of a database that may be usedby the control server to determine a priority associated with a reactionidentified in one or more of the received video streams. As can be seenin FIG. 4, a first column 402 of the database may contain entries foreach expected “movement” (i.e. reactions). For example, M1 maycorrespond to “smiling”, M2 may correspond to “head nodding”, M3 maycorrespond to “head shaking”, and so on and so forth.

A second column of the database 404 may contain entries for thepriorities associated with each expected movement. For example, movementM1 (e.g. smiling) may have a priority value P1, which is higher or lowerin value than the priority value P2 associated with movement M2 (e.g.head nodding). The priority values of each respective movement may beused to determine the manner in which video streams are displayedrelative to one another. For example, a video stream featuring ahigher-priority reaction may be displayed more prominently than a videostream featuring a lower-priority reaction. The priority values may beused, for example, to determine which of the units of a grid layout(such as those shown in FIG. 9) a selected video stream occupies.

The priorities may also be used to limit the number of video streamsthat are selected for display at a receiving terminal—for example, ifreactions are detected in multiple video streams but only a limitednumber of video streams can be displayed (effectively) at a particularreceiving terminal, the priority values may be used to determine whichof those video streams are selected for display.

In certain embodiments, there may be a limit on the number of detectiontypes that can occur within a certain duration of time, i.e. to controlthe amount of coverage that is displayed to a user within a specificduration of time. Over coverage of user activity may become disorientingto the user viewing it; it is therefore important to strike the balancebetween expanding the storied awareness of user activity whilst guardingagainst over coverage.

It will be appreciated that, whilst an individual priority value isshown for each expected movement, several movements may share the samepriority value and be grouped according to this priority value. Forexample, rather than having a priority value associated with eachmovement, movements may be grouped according to e.g. the type ofmovement, and movements of the same “type” may share the same priorityvalue. The “type” of a particular movement may determine itscorresponding priority value.

A third column of the database 406 may contain entries for the timeinterval associated with each movement, that is, the time interval forwhich the video stream associated with that movement should be displayedat a receiving terminal. Different expected movements may be associatedwith different time intervals depending on the nature of the movement.For example, a movement that involves the movement of the whole of auser's body may have a time interval that is longer than a movement thatcorresponds to e.g. “smiling”. The control server may use the timeinterval to determine when to stop transmitting the video streamassociated with a particular movement, to a particular receivingterminal. Three types of time interval (i.e. durations) are describedbelow.

Short Duration:

A set duration attached to a non-verbal event. Short would be set to aspecific duration (example: 1.5 seconds). Allowing the priority assignedvideo to priority stack in the video story for consuming participantsbut not interrupt the active speaking audio signal. The active speakingaudio signal would remain constant. Short will be assigned toparticipant activity that is added awareness but not essential at anextended activity. Including reaction shots: smiling, head nodding, headshaking, hand waving, hand pointing . . . ).

Medium Duration:

A set duration attached to a non-verbal event. Medium would be set to aspecific duration (example: 2.5 seconds). Allowing the priority assignedvideo to priority stack the video story for consuming participants butnot interrupt the active speaking audio signal. The active speakingaudio signal would remain constant. Medium is assigned to specificactivities deemed important to group activity awareness such as a changeof body location in the room. Or a detection of a new body or person inthe room (stand, sit, walk enter, leave a location).

Extended Duration:

The set duration for dominant activity participants. This duration isprimarily assigned to active speaker. Giving active speaker the dominantstory priority unless interrupted by a short duration story view ordepreciated due to lack of speaking. An example of this is if story viewis in single grid view and is an edge to edge video of active speaker.When a short or medium duration priority video is triggered to replacethe active speaker video (but not the active audio) once the limitedduration video has timed out it is replaced by the continuous activespeaker video view that was populated at this location previously.

Persistent Duration:

The set duration for dominant activity participants. This duration isprimarily assigned to a user pinned or view type that does not allow avideo view to be interrupted. Thus duration is continuous until the userre-assigns the view or the meeting ends. This may take precedence overall other duration/non-verbal coverage types that normally would beassigned priority against this active view

Non-verbal communication duration priority metric for: body, arm, hand,gesture, head, face and eye movement detection for story video prioritymetric. Duration priority metric works in conjunction with a playbackdurations library: short, medium, priority, and extended specification.As well as story grid location priority designated by a stack ranking ofmost recent activity, participant association or user preference.

For greater story continuity and fluid people engagement experiences thecamera view grid updates should also be populated as a single or a groupin a set duration of time whether it be from a single location ormultiple locations. This is to ensure that the video playback is asfluid and noise-free as possible. It is also to ensure the least amountof video grid view updates occur within a designated duration of time,thereby allowing the story experience to be as engaging and easy tofollow the action. It should also be noted that, for the durationsdescribed above, user or participant tagging may also influence thesystem priority stack. For example, a user may tag sensor data (videoviews) and a priority may be placed on those views for real-time storyplayback, recording, or editing after the event.

It will be appreciated that, whilst a separate column is shown in FIG. 4for the priority and time interval of each respective movement, thesetwo parameters may in fact be correlated (i.e. derivable from oneanother).

For example, the priority value of an expected movement may alsodetermine the time interval for which it is displayed. For example, anexpected movement with a higher priority value may be displayed forlonger than an expected movement with a lower priority value.Alternatively, an expected movement with a lower-priority value may bedisplayed for a longer time interval.

Ultimately, any relationship between the priority value and timeinterval may be used. This relationship may allow time intervals to bedetermined ‘on the fly’ for each identified expected movement. That is,rather than storing a time interval for each of the possible expectedmovements in a database, the database may contain entries for thepriority values only, and use these to determine the time intervalassociated with a particular movement, as and when that movement isidentified within a particular video stream.

One or more other columns 408 of the database may contain entriespertaining to other parameters. For example, these parameters may relateto the grouping of different types of reactions, e.g. reactionsinvolving hand movements may belong to a particular group, whilstreactions involving changes in a user's facial expression may belong toa different group. Each expected movement may be associated with a groupvalue and expected movements sharing the same group value may be deemedto be of the same “type” (which may indicate that they share the samepriority values and/or time intervals).

Additionally, the database may include a column for the parametersdefining each expected movement. These parameters may define, for eachexpected movement, the corresponding changes in the relative positioningof a user's eyes, nose, mouth, eyebrows, hands etc. These parameters mayalso be associated with a margin of error—i.e. a range in which therelative positioning of a user's eyes, nose, mouth, eyebrows, hands etc.may change, and still be identified as corresponding to the respectiveexpected movement.

FIG. 5 illustrates a flowchart of the method performed at the controlserver for determining when to select a video stream for display at areceiving terminal (e.g. User D's user terminal).

It should be noted that, whilst FIG. 5 only shows a method fordetermining whether to select a single video stream for display at areceiving terminal, the control server may perform multiple instances ofthe described method, e.g in parallel, in order to determine whether aplurality of video streams should be selected for display at a receivingterminal.

At step, S502, a plurality of video streams are received at the controlserver (i.e. at the network interface of the control server). Forexample, these video streams may be received from the user terminalsassociated with users A, B and C. Alternatively, two or more of thesevideo streams may be received from a single camera device, associatedwith two or more of users A, B and C.

At step S504, the control server selects a subset of the received videostreams for display at the receiving terminal. The control server causesthese video streams to be displayed at the receiving terminal, i.e. bytransmitting them, along with any associated layout parameters, to thereceiving terminal.

At step S506, the control server identifies a video stream that is notcurrently being displayed at the receiving terminal (herein referred toas “the identified video stream”). For example, each of the videostreams received at the control server may include an indication ofwhether or not they are currently being displayed at the receivingterminal. The control server may use these indications to identify avideo stream that is not currently being displayed at the receivingterminal.

Alternatively, a separate module within the control server (not shown inFIG. 3) may keep track of the video streams that were previouslyselected for display at the receiving terminal. This information may beused by the control server to identify a video stream that is notcurrently being displayed at the receiving terminal.

At step S508, the control server identifies one or more human featuresof the user identified within the identified video stream. As notedearlier in relation to FIG. 3, the feature detection module 308 mayidentify that a user is present in the identified video stream (e.g.based on skeletal and/or facial tracking) and a feature tracking module310 may use this information to identify one or more human features ofthe identified user.

At step S510, the control server tracks the movement of the one or moreidentified human features. This may involve, for example, tracking themovement of a user's eyes and mouth, to determine whether the user issmiling or frowning etc.

At step S512, the control server identifies that the movement of the oneor more human features corresponds to an expected movement, i.e. a known“reaction”. As noted earlier in relation FIG. 3, this may involvedetermining parameters for the identified movement and comparing thesewith the parameters defining expected movements.

At step S514, the control server determines whether to cause theidentified video stream to be displayed at the receiving terminal. Ifthe control server determines that the identified video stream shouldnot be displayed at the receiving terminal (indicated by ‘NO’ in FIG.5), the control server continues to track the one or more identifiedhuman features of the user identified in the identified stream.

The control server may, for example, determine a priority valueassociated with the identified movement, and determine whether thisvalue is higher than a priority value determined for a second videostream in which an expected movement was also identified. If, forexample, the priority value of the expected movement in the identifiedstream is lower than the expected movement detected in the second videostream, the control server may determine that the identified videostream should not be displayed at the receiving terminal. If, whilstdisplaying the video stream in which an expected movement was detected,the audio of a new speaking user is detected, the control server mayensure that once the second time interval has elapsed, the video streamassociated with the new speaking user is selected for display (andcaused to be displayed) at the receiving terminal.

If the control server determines that the identified video stream shouldbe displayed at the receiving terminal (indicated by ‘YES’ in FIG. 5),the control server selects the video stream for display at the receivingterminal.

At step S516, the control server determines the time interval for whichthe selected video stream should be displayed and any layout parametersthat are needed in order to define the way in which the selected videostream will be displayed at the receiving terminal (e.g. relative to anyother video streams that have been selected for display at the receivingterminal).

In one embodiment, the time interval associated with the selected videostream may be derived, for example, from the priority associated withthe identified “expected movement”. As noted earlier in relation to FIG.3, each of the “expected movements” may be associated with a priority,and the priority may determine where, and for how long, the selectedvideo stream is displayed at the receiving terminal.

At step S518, the control server transmits the selected video stream tothe receiving terminal, along with any associated layout parameters. Asnoted earlier, the layout parameters are used by the receiving terminalto determine the manner in which the selected video stream is to bedisplayed.

At step S520, the control server detects that the time intervalassociated with the selected video stream has elapsed and stops sendingthe selected video stream. In response to the time interval elapsing,the control server may transmit new layout data to the receivingterminal, thereby ensuring that screen space is not allocated to videostreams that are no longer being transmitted to the receiving terminal,from the control server.

FIG. 6A illustrates an example of a moving image of a user, user 604A,that may be displayed at the display of User D's user terminal, during afirst time interval, during the video conference.

User 604A may be a user that has been determined to be important, basede.g. on a recent detection of the user's speech, or the user havinginitiated the video conference. This user is herein referred to as the“primary user”, with an associated “primary video stream”.

During the first time interval, the control server may determine that asecond user has reacted to the actions performed by primary user 604A.For example, the control server may identify that a second user, hereinreferred to as the “reacting user” 604B, has smiled during the firsttime interval. In response to detecting the reacting user's smile, thecontrol server may select the video stream associated with the reactinguser for display at the receiving user's user terminal. This videostream is herein referred to as the “reacting user's video stream”.

An example embodiment is illustrated in FIG. 6B, where the moving imageof primary user 604A, has been replaced with the moving image ofreacting user 604B. As noted earlier, the moving image of the reactinguser is displayed for a predetermined time interval (the second timeinterval). The control server may ensure that the video of the primaryuser is not transmitted to User D, for the duration of this timeinterval.

The control server may also ensure that any audio (i.e. detected speech)associated with the primary user is still transmitted to User D. Thatis, the control server may treat the video and audio streams of eachuser (e.g. users A, B and C) separately, and only determine which of thevideo streams (and not audio streams) to select for display User D'suser terminal. Hence, User D is able to continue to listen to the speechof the primary user, whilst also viewing the reactions of other users,as and when they occur.

When a single grid video view is streaming from a location and a newvideo priority type is detected, a duration type is assigned to thatvideo depending on the detection type and it replaces the lower priorityvideo stream. In most cases non-verbal communication is a video durationpriority only. The audio priority stack preforms separately.

In an alternative embodiment, in response to determining that a seconduser has reacted to the actions performed by primary user 604A, thecontrol server may continue to transmit the primary video stream to UserD's user terminal, and also select the reacting user's video stream fortransmission to (and subsequent display at) User D's user terminal. Thismay also include transmitting new layout parameters to User D's userterminal 6 d—i.e. layout parameters that ensure that the two videostreams are displayed using grid layout 904 (FIG. 9).

This is shown in FIG. 6C, where both the video streams of the primaryand reacting user's are shown simultaneously, adjacent to one another,at the display of User D's user terminal. In this particular embodiment,User D is able to view both the primary user (who may be, for example aspeaking user) as well as the reaction of user 604B (who may be reactingto what the primary user is saying).

FIG. 7A shows an alternative embodiment in which two primary users maybe displayed at the display of the receiving terminal, during a firsttime interval, during the video conference. This may occur, for example,where both of the primary users are determined as being of equalimportance (for example, where audio data has been recently detected forboth users). Alternatively, this may be a default setting for areceiving user that is in a video conference with two other users (asshown in FIG. 1).

Again, during the first time interval, the control server may identifythat a third user—the reacting user, has smiled during the first timeinterval. In response to detecting the reacting user's smile, thecontrol server may select the reacting user's video stream for displayat the receiving terminal.

In this particular embodiment, the control server may cause one of thevideo streams displaying a second primary user, user 704A, to bereplaced with the video stream associated with the reacting user, 704B.The control server may determine a relative priority of each of thevideo streams associated with the primary users (e.g. based on which ofthe two primary users spoke most recently), and based on this, selectthe video stream with the highest priority for display at the receivingterminal.

The control server may then continue to transmit the video streamassociated with the highest priority to the receiving terminal, and alsotransmit the reacting user's video stream to the receiving terminal.This may involve sending new layout data to the receiving terminal, suchthat, in response to receiving the new layout data, the receivingterminal displays the video stream of a first primary user, user 604A,and the reacting user, 704B, in a particular arrangement at thereceiving terminal.

Such an arrangement is illustrated in FIG. 7B, where the moving image ofprimary user 704A, has been replaced with the moving image of reactinguser 704B. Again, the moving image of the reacting user is displayed fora predetermined time interval (the second time interval), which may beindependent of the time interval for which the primary user, user 604A,is displayed at the receiving terminal.

FIG. 8 shows an alternative embodiment in which three primary users aredisplayed at the display of the receiving user's user terminal. In thisembodiment, the video stream of a third primary user 804A is replacedwith the video stream of a reacting user 804B. As in FIGS. 7A and 7B,each of the video streams may be associated with a priority, and thevideo stream with the lowest priority may be replaced with the videostream associated with the reacting user. Additionally, the reactinguser's video stream may occupy a larger segment of the receivingterminal's display, depending on the priority associated with theidentified reaction.

In the example of FIG. 8, the detected smile of the reacting user is ofa high enough priority to replace the video of e.g. an inactive user,but not of a high enough priority to replace the video of a speakinguser, such as user 804A.

It will be appreciated that while FIG. 8 is described in the context ofreplacing one of three primary video streams with a reacting user'svideo stream, any number of the three primary video streams may bereplaced with the video streams of a reacting user (depending on thenumber of participants in the video conference, and the number reactingusers etc.).

It will also be appreciated that, if the control server causes anincrease in the number of video streams that are displayed at areceiving terminal, then any number of reacting users may be displayedin addition to the one or more primary users.

For example, if a primary user is displayed in a first window 602A, anda reaction is identified in the video streams of two other users, thefirst window 602A may be updated so as to display the video stream ofthe primary speaker, as well as the video streams of the two otherreacting users. This may involve transmitting new layout parameters fromthe control server to the receiving terminal, e.g. layout parametersthat enable the video streams to be displayed using grid layout 906(FIG. 9) instead of grid layout 902 (FIG. 9).

For example window 602A may be replaced with a window akin the displaywindow 802B shown in FIG. 8B, but with a reacting user displayed in eachof the two smaller segments of the display window. The two reactinguser's video streams may be displayed for the same or different timeintervals, depending on the reaction identified in each of the videostreams (e.g. whether they both belong to a group of reactions thatshare the same or similar time intervals).

In an alternative embodiment, the control server may increase the numberof video streams that are displayed at a receiving terminal, such that adisplay window showing two primary users (e.g. the display window shownin FIG. 7A) is updated so as to also display the video stream of areacting user (e.g. the display window shown in FIG. 8B), in addition tothe video streams of the two primary users.

When a multi-grid video story view is streaming for group activity and anew video priority type is detected, a duration type is assigned to thatnew video depending on the detection type. The duration type determineshow long the singular priority will last until the priority is reset tothe current detected participant activity or is overruled by a higherpriority video. In a multi-grid scenario, the least active video isreplaced by the new duration type priority video, unless it is tagged asrelated to a specific location or participant, in which case, itreplaces the grid view of that same participant or location feed foronly the specified time to maintain story continuity.

In a broadcast or presentation experience, the audience consuming thebroadcast or presentation may have less awareness of the reactions ofthe other users consuming broadcast or presentation, whilst e.g. thepresenting user may have more awareness of the reactions of the usersconsuming his/her presentation.

Generally, unless otherwise indicated, any of the functions describedherein can be implemented using software, firmware, hardware (e.g.,fixed logic circuitry), or a combination of these implementations. Theterms “module,” “functionality,” “component” and “logic” as used hereingenerally represent software, firmware, hardware, or a combinationthereof. In the case of a software implementation, the module,functionality, or logic represents program code that performs specifiedtasks when executed on a processor (e.g. CPU or CPUs). The program codecan be stored in one or more computer readable memory devices. Thefeatures of the techniques described below are platform-independent,meaning that the techniques may be implemented on a variety ofcommercial computing platforms having a variety of processors.

For example, the user terminals may also include an entity (e.g.software) that causes hardware of the user terminals to performoperations, e.g., processors functional blocks, and so on. For example,the user terminals may include a computer-readable medium that may beconfigured to maintain instructions that cause the user terminals, andmore particularly the operating system and associated hardware of theuser terminals to perform operations. Thus, the instructions function toconfigure the operating system and associated hardware to perform theoperations and in this way result in transformation of the operatingsystem and associated hardware to perform functions. The instructionsmay be provided by the computer-readable medium to the user terminalsthrough a variety of different configurations.

One such configuration of a computer-readable medium is signal bearingmedium and thus is configured to transmit the instructions (e.g. as acarrier wave) to the computing device, such as via a network. Thecomputer-readable medium may also be configured as a computer-readablestorage medium and thus is not a signal bearing medium. Examples of acomputer-readable storage medium include a random-access memory (RAM),read-only memory (ROM), an optical disc, flash memory, hard disk memory,and other memory devices that may us magnetic, optical, and othertechniques to store instructions and other data.

According to a first aspect, the subject matter of present applicationprovides a computer-implemented method of effecting a communicationevent between a first user and one or more second users via acommunication network, the method comprising implementing on a computerconnected to the network: receiving, via the network, a plurality ofvideo streams, each carrying a respective moving image of at least onerespective user; causing the respective moving image of a first of thevideo streams to be displayed at a user device of the first user for afirst time interval; identifying in the respective moving image of asecond of the video streams that is not displayed at the user device inthe first time interval, a human feature of the respective user;detecting a movement of the identified human feature during the firsttime interval that matches one of a plurality of expected movements; andin response to the detected movement, causing the respective movingimage of at least the second video stream to be displayed at the userdevice for a second time interval.

In embodiments, the computer may determine the duration of the secondtime interval based on which of the plurality of expected movements themovement of the identified human feature is detected as matching.

Each of the plurality of expected movements may be associated with apriority value, and the computer may use the priority value to selectthe second stream from the plurality of video streams for saiddisplaying at the user device for the second time interval.

The computer may be embodied in a server.

Alternatively, the computer may be embodied in the user device.

In some embodiments, causing at least the second video stream to bedisplayed at the user device may comprise replacing the first videostream with the second video stream, such that the first video stream isnot displayed at the user device for the second interval.

In other embodiments, both the first and second video streams may bedisplayed at the user device for the second interval.

The computer may be separate from the user device and the computer maycause the moving image of each of the first and second video streams tobe displayed at the user device, by transmitting that stream to the userdevice via the network for displaying thereat.

In further embodiments, a third video stream may be displayed at theuser device in the first time interval in addition to the first videostream and the third video stream may be replaced with the second videostream for the second interval, such that the third video stream is notdisplayed at the user device for the second interval.

The computer implemented method of the first aspect may also include: inresponse to detecting said movement, selecting a first of a plurality ofpredetermined layouts for displaying at least the second video stream atthe user device for the second time interval, wherein each of theplurality of predetermined layouts is for displaying a different numberof video streams at the user device, wherein a different one of thepredetermined layouts is used to display the first stream in the firsttime interval.

In some embodiments, the computer implemented method may cause audiodata associated with the first video stream to be played out at the userdevice during both the first and the second time intervals. The audiodata may be played out in the first and second time intervals inresponse to the computer detecting that the user in the moving image ofthe first video stream is speaking.

In further embodiments, at least two of the plurality of streams may bereceived from different communication client instances, each of thedifferent communication client instances being executed at a differentuser device. Each of the video streams may be received from a differentcommunication client instance executed on a different user device.

According to a second aspect, the subject-matter of the presentapplication provides computer for effecting a communication eventbetween a first user and one or more second users via a communicationnetwork, the computer comprising: a network interface configured toreceive, via the network, a plurality of video streams, each carrying arespective moving image of one or more users; a processor configured toperform operations of: causing the respective moving image of a first ofthe video streams to be displayed at a user device of the first user fora first time interval; identifying in the respective moving image of asecond of the video streams that is not displayed at the user device inthe first time interval, a human feature of the respective user;detecting a movement of the identified human feature during the firsttime interval that matches one of a plurality of expected movements; andin response to the detected movement, causing the respective movingimage of at least the second video stream to be displayed at the userdevice for a second time interval.

The computer may determine the duration of the second time intervalbased on which of the plurality of expected movements the movement ofthe identified human feature is detected as matching.

Each of the plurality of expected movements is associated with apriority value, and the computer uses the priority value to select thesecond stream from the plurality of video streams for said displaying atthe user device for the second time interval.

At least one of the plurality of expected movements may corresponds tothe user in the moving image of the second video image: smiling,frowning, laughing, gasping, nodding their head, shaking their head,pointing in a particular direction with one or both of their hands,waving with one or both of their hands, raising or lowering one or bothof their arms above or below a predetermined height, clapping, movingone or more clenched fists (i.e. so as to indicate celebration orfrustration), and giving a thumbs up or down with one or both of theirhands.

The computer of the second aspect may also include a processorconfigured to perform the operation of: in response to detecting saidmovement, selecting a first of a plurality of predetermined layouts fordisplaying at least the second video stream at the user device for thesecond time interval, wherein each of the plurality of predeterminedlayouts is for displaying a different number of video streams at theuser device, wherein a different one of the predetermined layouts isused to display the first stream in the first time interval.

According to a third aspect, the subject-matter of the presentapplication provides a computer program product for effecting acommunication event between a first user and one or more second usersvia a communication network, the computer program product comprisingcode stored on a computer readable storage medium and configured whenexecuted on a computer to perform the following operations: receiving,via the network, a plurality of video streams, each carrying arespective moving image of one or more users; causing the respectivemoving image of a first of the video streams to be displayed at a userdevice of the first user for a first time interval; identifying in therespective moving image of a second of the video streams that is notdisplayed at the user device in the first time interval, a human featureof the respective user; detecting a movement of the identified humanfeature during the first time interval that matches one of a pluralityof expected movements; and in response to the detected movement, causingat least the second video stream to be displayed at the user device fora second time interval.

In embodiments of any one of the above aspects, features of anyembodiment of any other of the aspects may be implemented.

Although the subject matter has been described in language specific tostructural features and/or methodological acts, it is to be understoodthat the subject matter defined in the appended claims is notnecessarily limited to the specific features or acts described above.Rather, the specific features and acts described above are disclosed asexample forms of implementing the claims.

The invention claimed is:
 1. A computer-implemented method of effectinga communication event between a first user and one or more second usersvia a communication network, the method comprising: receiving, via thecommunication network, a plurality of video streams, each of the videostreams carrying a respective moving image of at least one respectiveuser; causing the respective moving image of a first of the videostreams to be displayed at a user device of the first user for a firsttime interval; identifying in the respective moving images of one ormore of the other video streams a human feature of the respective user;detecting a movement of one or more of the identified human featuresduring the first time interval that matches one of a plurality ofexpected movements, each of the plurality of expected movementsassociated with a priority value; selecting one of the one or more ofthe other video streams based on the priority value associated with theexpected movements; and causing the respective moving image of theselected video stream to be displayed at the user device for a secondtime interval.
 2. The method of claim 1, wherein a duration of thesecond time interval is determined based on which of the plurality ofexpected movements the movement of the identified human feature isdetected as matching.
 3. The method of claim 1, wherein causing theselected video stream to be displayed at the user device comprisesreplacing the first video stream with the selected video stream, suchthat the first video stream is not displayed at the user device for thesecond interval.
 4. The method of claim 1, wherein both the first andselected video streams are displayed at the user device for the secondinterval.
 5. The method of claim 1 wherein a second video stream isdisplayed at the user device in the first time interval in addition tothe first video stream and the second video stream is replaced with theselected video stream for the second interval, such that the secondvideo stream is not displayed at the user device for the secondinterval.
 6. The method of claim 1, comprising, in response to detectingsaid movement, selecting a first of a plurality of predetermined layoutsfor displaying at least the selected video stream at the user device forthe second time interval, wherein each of the plurality of predeterminedlayouts is for displaying a different number of video streams at theuser device, wherein a different one of the predetermined layouts isused to display the first stream in the first time interval.
 7. Themethod of claim 1, comprising causing audio data associated with thefirst video stream to be played out at the user device during both thefirst and the second time intervals.
 8. The method of claim 7, whereinthe audio data is played out in the first and second time intervals inresponse to detecting that the user in the moving image of the firstvideo stream is speaking.
 9. The method of claim 1, wherein at least twoof the plurality of streams are received from different communicationclient instances, each of the different communication client instancesbeing executed at a different user device.
 10. The method of claim 9,wherein each of the video streams is received from a differentcommunication client instance executed on a different user device. 11.The method of claim 1, wherein a computer separate from the user devicecauses the moving image of each of the first and selected video streamsto be displayed at the user device, by transmitting that stream to theuser device via the network for displaying thereat.
 12. The method ofclaim 11, wherein the computer is embodied in a server.
 13. The methodof claim 1, wherein the plurality of expected movements comprises atleast one of smiling, frowning, laughing, gasping, a head nod, a headshake, pointing in a particular direction with one or both hands, wavingwith one or both hands, raising or lowering one or both arms above orbelow a predetermined height, clapping, moving one or more clenchedfists, or giving a thumbs up or down with one or both hands.
 14. Acomputer for effecting a communication event between a first user andone or more second users via a communication network, the computercomprising: a network interface configured to receive, via thecommunication network, a plurality of video streams, each of the videostreams carrying a respective moving image of at least one respectiveuser; a processor configured to perform operations of: causing therespective moving image of a first of the video streams to be displayedat a user device of the first user for a first time interval;identifying in the respective moving images of one or more of the othervideo streams a human feature of the respective user; detecting amovement of one or more of the identified human features during thefirst time interval that matches one of a plurality of expectedmovements, each of the plurality of expected movements associated with apriority value; and causing the respective moving image of the selectedvideo stream to be displayed at the user device for a second timeinterval.
 15. The computer of claim 14, wherein the computer determinesa duration of the second time interval based on which of the pluralityof expected movements the movement of the identified human feature isdetected as matching.
 16. The computer of claim 14, wherein theprocessor is further configured to perform operations comprising, inresponse to detecting said movement, selecting a first of a plurality ofpredetermined layouts for displaying at least the selected video streamat the user device for the second time interval, wherein each of theplurality of predetermined layouts is for displaying a different numberof video streams at the user device, wherein a different one of thepredetermined layouts is used to display the first stream in the firsttime interval.
 17. The computer of claim 14, wherein the plurality ofexpected movements comprises at least one of smiling, frowning,laughing, gasping, a head nod, a head shake, pointing in a particulardirection with one or both hands, waving with one or both hands, raisingor lowering one or both arms above or below a predetermined height,clapping, moving one or more clenched fists, or giving a thumbs up ordown with one or both hands.
 18. A computer program product foreffecting a communication event between a first user and one or moresecond users via a communication network, the computer program productcomprising code stored on a computer readable storage medium andconfigured when executed on a computer to perform the followingoperations: receiving, via the communication network, a plurality ofvideo streams, each of the video streams carrying a respective movingimage of at least one respective user; causing the respective movingimage of a first of the video streams to be displayed at a user deviceof the first user for a first time interval; identifying in therespective moving images of one or more of the other video streams ahuman feature of the respective user; detecting a movement of one ormore of the identified human features during the first time intervalthat matches one of a plurality of expected movements, each of theplurality of expected movements associated with a priority value; andcausing the respective moving image of the selected video stream to bedisplayed at the user device for a second time interval.
 19. Thecomputer program product of claim 18, wherein a duration of the secondtime interval is determined based on which of the plurality of expectedmovements the movement of the identified human feature is detected asmatching.
 20. The computer program product of claim 18, wherein theplurality of expected movements comprises at least one of smiling,frowning, laughing, gasping, a head nod, a head shake, pointing in aparticular direction with one or both hands, waving with one or bothhands, raising or lowering one or both arms above or below apredetermined height, clapping, moving one or more clenched fists, orgiving a thumbs up or down with one or both hands.